Skip to content

feat(chat): Add durable agent continuation queue#470

Merged
dcramer merged 45 commits into
mainfrom
feat/durable-agent-continuation
Jun 3, 2026
Merged

feat(chat): Add durable agent continuation queue#470
dcramer merged 45 commits into
mainfrom
feat/durable-agent-continuation

Conversation

@dcramer

@dcramer dcramer commented Jun 1, 2026

Copy link
Copy Markdown
Member

Move production Slack turn execution to a durable conversation mailbox with Vercel Queue wake-ups. This lets timed-out or vanished serverless workers recover through heartbeat-driven continuation instead of waiting for another user message.

Durable Execution

Adds conversation work state, leases, check-ins, queue callbacks, and heartbeat repair for expired leases or stranded mailbox work. The queue callback is exposed as /api/internal/agent/continue and carries only conversationId.

Slack Cutover

Slack webhooks now normalize inbound events into durable mailbox records and wake the worker. Routine timeout/cooperative continuation no longer posts visible thread notices; progress remains owned by assistant status and reportProgress.

Specs And Verification

Documents the task execution contract, updates resumability and Slack delivery specs, refreshes Pi agent integration references, and adds integration coverage for mailbox execution, heartbeat recovery, Slack ingress, timeout continuation, and Vercel queue config.

@vercel

vercel Bot commented Jun 1, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
junior-docs Ready Ready Preview, Comment Jun 3, 2026 1:53pm

Request Review

Comment thread packages/junior/src/chat/task-execution/slack-work.ts Outdated
Comment thread packages/junior/src/chat/task-execution/slack-work.ts Outdated
dcramer added a commit that referenced this pull request Jun 2, 2026
Keep queued Slack mailbox records pending until the Slack runtime handoff succeeds.

Mark only processed records as injected after the handler returns.

Complete successful Slack handlers even after the soft deadline has elapsed.

This avoids duplicate queue nudges that can replay injected work.

Refs GH-470

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Comment thread packages/junior/src/chat/task-execution/store.ts Outdated
Comment thread packages/junior/src/chat/runtime/reply-executor.ts
Comment thread packages/junior/src/chat/task-execution/worker.ts Outdated
dcramer added a commit that referenced this pull request Jun 2, 2026
Preserve runnable state across leased queue work, process pending Slack mailbox records before timeout resumes, and avoid replaying already-injected Slack work after recovery.

Add replay protection for already-delivered Slack replies and a high-water timeout slice cap so pathological continuations fail instead of scheduling forever.

Refs GH-470
Co-Authored-By: GPT-5 Codex <codex@openai.com>
Comment thread packages/junior/src/chat/task-execution/store.ts Outdated
@dcramer dcramer marked this pull request as ready for review June 2, 2026 08:57
dcramer added a commit that referenced this pull request Jun 2, 2026
Preserve runnable conversation state when a leased worker completes after needsRun was marked during execution. This keeps continuation and late work recovery from waiting on heartbeat repair.

Scope worker and heartbeat queue idempotency keys to a specific wake-up attempt so provider dedupe cannot suppress later legitimate recovery nudges.

Move deterministic worker, lease, mailbox, and timeout-resume coverage into component tests and document the layer boundary.

Refs GH-470

Co-Authored-By: Codex GPT-5 <noreply@openai.com>
Comment thread packages/junior/src/chat/ingress/slack-webhook.ts
dcramer added a commit that referenced this pull request Jun 2, 2026
Treat app_home_opened view publishing as a best-effort side effect so transient Slack API failures do not make the Events API webhook return 500 and trigger repeated Slack retries.

Add a Slack webhook integration test that drives a signed app_home_opened event through the real ingress path while views.publish fails.

Refs GH-470

Co-Authored-By: Codex GPT-5 <noreply@openai.com>
Comment thread packages/junior/src/chat/task-execution/slack-work.ts
dcramer added a commit that referenced this pull request Jun 2, 2026
Derive restored Slack thread subscription context from the promoted batch route used for dispatch. Mixed queued batches now pass a mention-context thread into the mention handler instead of inheriting subscribed context from the latest message metadata.

Add a component regression that persists a mention plus a subscribed follow-up, verifies the mixed mailbox routes, and checks the restored thread context observed by the runtime.

Refs GH-470

Co-Authored-By: Codex GPT-5 <noreply@openai.com>
Comment thread packages/junior/src/chat/ingress/slack-webhook.ts
Comment thread packages/junior/src/chat/task-execution/vercel-callback.ts Outdated
dcramer added a commit that referenced this pull request Jun 2, 2026
Keep queued Slack mailbox records pending until the Slack runtime handoff succeeds.

Mark only processed records as injected after the handler returns.

Complete successful Slack handlers even after the soft deadline has elapsed.

This avoids duplicate queue nudges that can replay injected work.

Refs GH-470

Co-Authored-By: GPT-5 Codex <codex@openai.com>
dcramer added a commit that referenced this pull request Jun 2, 2026
Preserve runnable state across leased queue work, process pending Slack mailbox records before timeout resumes, and avoid replaying already-injected Slack work after recovery.

Add replay protection for already-delivered Slack replies and a high-water timeout slice cap so pathological continuations fail instead of scheduling forever.

Refs GH-470
Co-Authored-By: GPT-5 Codex <codex@openai.com>
dcramer added a commit that referenced this pull request Jun 2, 2026
Preserve runnable conversation state when a leased worker completes after needsRun was marked during execution. This keeps continuation and late work recovery from waiting on heartbeat repair.

Scope worker and heartbeat queue idempotency keys to a specific wake-up attempt so provider dedupe cannot suppress later legitimate recovery nudges.

Move deterministic worker, lease, mailbox, and timeout-resume coverage into component tests and document the layer boundary.

Refs GH-470

Co-Authored-By: Codex GPT-5 <noreply@openai.com>
dcramer added a commit that referenced this pull request Jun 2, 2026
Treat app_home_opened view publishing as a best-effort side effect so transient Slack API failures do not make the Events API webhook return 500 and trigger repeated Slack retries.

Add a Slack webhook integration test that drives a signed app_home_opened event through the real ingress path while views.publish fails.

Refs GH-470

Co-Authored-By: Codex GPT-5 <noreply@openai.com>
@dcramer dcramer force-pushed the feat/durable-agent-continuation branch from 505841e to 2e8846d Compare June 2, 2026 14:25
dcramer added a commit that referenced this pull request Jun 2, 2026
Derive restored Slack thread subscription context from the promoted batch route used for dispatch. Mixed queued batches now pass a mention-context thread into the mention handler instead of inheriting subscribed context from the latest message metadata.

Add a component regression that persists a mention plus a subscribed follow-up, verifies the mixed mailbox routes, and checks the restored thread context observed by the runtime.

Refs GH-470

Co-Authored-By: Codex GPT-5 <noreply@openai.com>
dcramer added a commit that referenced this pull request Jun 2, 2026
Update the rebased lockfile so Vercel Queue peer resolution covers the dashboard and example workspace entries now present on main.

Refs GH-470
Co-Authored-By: Codex GPT-5 <noreply@openai.com>
Comment thread packages/junior/src/chat/task-execution/slack-work.ts Outdated
Comment thread packages/junior/src/chat/task-execution/slack-work.ts Outdated
dcramer added a commit that referenced this pull request Jun 2, 2026
Move Slack event callback processing behind waitUntil so Slack receives a fast acknowledgement while durable mailbox work still runs through the existing handoff path.

Give Vercel Queue visibility a buffer beyond the function timeout to avoid redelivery racing host teardown.

Refs GH-470
Co-Authored-By: Codex GPT-5 <noreply@openai.com>
dcramer and others added 26 commits June 3, 2026 14:10
Add a component regression that sends a Slack message_changed request through the durable webhook and queued worker path. This keeps edited mention coverage on the new mailbox architecture, not only the legacy Chat SDK webhook path.

Refs GH-470
Co-Authored-By: GPT-5 Codex <codex@openai.com>
Use the active turn deadline budget in timeout errors and timeout telemetry. This keeps resumed turns with shorter host request deadlines from reporting the configured maximum instead of the operative timeout.

Refs GH-470
Co-Authored-By: GPT-5 Codex <codex@openai.com>
Treat tool results, not steering user messages, as the terminal assistant output boundary. This prevents mid-turn steering from truncating assistant text that belongs to the same finalized reply.

Refs GH-470
Co-Authored-By: GPT-5 Codex <codex@openai.com>
Apply Pi steering inside the durable injection callback so a steering failure rejects before mailbox records are marked injected. This keeps Slack follow-ups pending for a later worker instead of silently dropping them.

Refs GH-470
Co-Authored-By: GPT-5 Codex <codex@openai.com>
Thread durable worker yield checks into Pi safe boundaries so Slack turns pause before starting another model iteration when the worker soft deadline has elapsed.

Carry host request deadlines into resumed Slack turns and skip heartbeat timeout-resume recovery when the persisted conversation no longer marks that session as the active turn.

Refs GH-470
Co-Authored-By: GPT-5 Codex <codex@openai.com>
Persist soft-yield boundaries with a distinct yield resume reason so routine worker continuation does not consume timeout resume slices.

Bubble cooperative yield through Slack runtime handling so the generic conversation worker releases the lease and requeues the next slice at the durable worker boundary.

Refs GH-470
Co-Authored-By: GPT-5 Codex <codex@openai.com>
Mark expired conversation leases runnable so recovered queue nudges can reach continuation scanning even when mailbox messages were already injected.

Include cooperative yield records in stale continuation heartbeat recovery.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Treat a timeout or yield continuation summary without a valid resume request as a worker failure instead of completed idle work.

This keeps the conversation runnable for queue retry or heartbeat repair instead of clearing needsRun after recovered idle work.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Make Slack resume startup report whether a continuation actually started so idle durable work can distinguish a real resume from a stale no-op.

Terminalize invalid or skipped awaiting timeout/yield sessions before completing idle work, and cover both invalid metadata and stale active-turn mismatch with component tests.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
When a new Slack message arrives while a previous turn is awaiting resume, schedule the old continuation without marking the new message as replied.

This keeps the follow-up available for steering or the next handled turn instead of silently treating it as answered.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Return a model-visible MCP tool error when a resumed slice asks for a tool name that is no longer present in the rebuilt provider catalog. Include recovery guidance so the agent can refresh the provider catalog before retrying.

Fixes GH-492

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Only mark Slack mailbox records injected after the runtime has durably persisted the turn handoff. Preserve pending mailbox work when a handler only reschedules an awaiting continuation, while still marking already-handled early replies after their thread state is saved.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Prefer awaiting turn continuation recovery before routing pending Slack mailbox records. When a continuation starts, leave pending mail runnable so the queue re-drives it after the active turn finishes instead of looping through a no-handoff Slack handler path.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Sign Vercel conversation work queue payloads with JUNIOR_SECRET and verify the decoded callback payload before processing work. This keeps the public queue route from executing forged conversation work even outside Vercel trigger enforcement.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Keep failed continuation persistence from becoming a generated assistant error reply, while preserving the terminal timeout slice-cap record as failed state.

Propagate Slack handoff lost-lease results through the conversation worker so downstream completion does not treat the run as successful.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Handle already-replied Slack deliveries before rescheduling an active continuation so duplicate events complete mailbox handoff without nudging the old turn again.

Keep auth-pause persistence failure behavior aligned with the existing provider-error contract while preserving the continuation failure fixes for yield and timeout paths.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Worker errors already marked the conversation runnable before releasing its lease, but a recent enqueue marker could make heartbeat defer recovery. Send a fresh wake-up nudge for failed runner slices so runnable work is redelivered promptly.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Cooperative yields could snapshot Pi state before queued steering messages appeared in agent.state.messages. Keep the latest safe boundary candidate available so yielded, timed-out, and auth-paused resumes do not overwrite a longer steering-aware transcript.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Bound signed conversation-work callbacks to a small timestamp window so old queue payloads cannot be replayed indefinitely.

When a Slack follow-up arrives during an active parked turn, keep auth pauses parked and fail malformed awaiting continuations before accepting new work. This prevents a fresh turn from replacing the durable session state.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Preserve rapid Slack follow-ups through durable conversation work and tighten timeout-resume scheduling.

Add shared test adapters for queue, Slack webhook, Slack outbox, waitUntil, and signed resume requests so tests exercise real boundaries with less mocking.

Document Django-inspired test adapter principles and stabilize slow Slack integration timeouts under clustered runs.

Refs GH-470

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Replace automatic processing eyes with a completion check when Slack turns finish.

Leave parked, skipped, and failed turns without completion so reactions match the lifecycle.

Restore requester credential context when timeout resumes rebuild reply context after the rebase.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Report whether running turn checkpoints actually persist before committing durable Slack mailbox input.

Propagate lost-input ownership errors instead of allowing a successful turn without mailbox commit.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Keep runnable conversation ids in the heartbeat recovery index when pruning overflow entries.

Treat failed worker lease check-ins as lost ownership so in-flight work cannot complete after lease loss.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Keep terminal timeout resume failures on the error path after the session record reaches the slice cap. This prevents Slack delivery from treating an exhausted turn as a successful assistant reply containing an Error-prefixed message.

Add a regression that seeds the durable session at the timeout cap and verifies the runtime throws while persisting the failed terminal record.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Skip duplicate inbound retries when the conversation already has a fresh queue marker.

Repair stale or missing markers with a fresh idempotency key so failed handoffs still recover promptly.

Add fake queue attempt introspection and Slack retry coverage so duplicate sends are visible in tests.

Refs GH-470

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Drop unused imports left after the latest rebase conflict resolution so lint passes with warnings denied.

Refs GH-470

Co-Authored-By: GPT-5 Codex <codex@openai.com>
Comment thread packages/junior/src/chat/runtime/reply-executor.ts
When a Slack turn only reschedules an awaiting continuation, persist and commit the input hooks before returning.

This lets durable mailbox workers mark the inbound row injected without sending a visible reply.

Refs GH-470

Co-Authored-By: GPT-5 Codex <codex@openai.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 91b0e39. Configure here.

Comment thread packages/junior/src/chat/task-execution/worker.ts
Mark lost-lease worker exits as runnable before releasing the conversation lease so queued work can recover immediately instead of waiting for lease TTL repair.

Refs GH-470
Co-Authored-By: GPT-5 Codex <codex@openai.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant